Introduction to Open Data Science - Course Project

About the project

This course, Introduction to Open Data Science 2020, builds on 1) openness of science and data, and 2) data science (e.g., wrangling, examining, analyzing, and visualizing data). Because data science also includes collaborating with others and using openly available software tools and services (e.g., R, RStudio, Git, GitHub), openness is the key word in this course.

The course has 6 main topics:
1. Introduction
2. Regression and model validation
3. Logistic regression
4. Clustering and classification
5. Dimensionality reduction techniques
6. Analysis of longitudinal data

Each week has deadlines for 1) submitting exercises, and 2) peer reviewing submissions from a few of the other students.

The course platform in MOOC contains lots of relevant information: instructions regarding the course, discussion forums, lecture videos, other videos, exercises, a place for submissions, links to other helpful resources (including books). Lectures (and smaller group sessions) take place on mondays!

Overall, this course will introduce ways to become better at interpreting data, presenting findings, and sharing knowledge.

Thoughts during the first week

I really enjoyed the first zoom webinar and found the DataCamp videos extremely informative. The first part “Hello R” clarified the way R works although its content was familiar to me. Most of the topics in the second part “Getting hooked on R” were new to me.

I have not used GitHub before so I really appreciated the clear instructions on the course site and the opportunity to look at the webinar regarding GitHub and RStudio.

I am very excited about the upcoming weeks, particularly about getting more familiar with GitHub and learning to use R better. I expect the R part to be time consuming, occasionally frustrating, and rewarding.


Regression analysis (week2)

#Loading libraries
library(dplyr)
library(ggplot2)
library(GGally)
library(tidyverse)

Loading and describing the data

getwd()
## [1] "C:/Users/Pinja/Documents/GitHub/IODS-project"
#setting working directory as the data folder to read in the data. Reading the modified data into R from my local folder and saving it into lrn2014
setwd('./data') 
lrn2014 <- read.csv('learning2014.csv')

#examining the data
dim(lrn2014) #166 observations and 8 variables (one is an additional id-variable)
## [1] 166   8
str(lrn2014)
## 'data.frame':    166 obs. of  8 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
#Reading data from online to certainly have it correct

lrn14 <-read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", sep=",", header=TRUE)

dim(lrn14) #the data has 166 observations and 7 variables
## [1] 166   7
str(lrn14) #the variables are: gender, age, attitude, deep, stra, surf, and points
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
summary(lrn14)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00
glimpse(lrn14)
## Rows: 166
## Columns: 7
## $ gender   <chr> "F", "M", "F", "M", "M", "F", "M", "F", "M", "F", "M", "F"...
## $ age      <int> 53, 55, 49, 53, 49, 38, 50, 37, 37, 42, 37, 34, 34, 34, 35...
## $ attitude <dbl> 3.7, 3.1, 2.5, 3.5, 3.7, 3.8, 3.5, 2.9, 3.8, 2.1, 3.9, 3.8...
## $ deep     <dbl> 3.583333, 2.916667, 3.500000, 3.500000, 3.666667, 4.750000...
## $ stra     <dbl> 3.375, 2.750, 3.625, 3.125, 3.625, 3.625, 2.250, 4.000, 4....
## $ surf     <dbl> 2.583333, 3.166667, 2.250000, 2.250000, 2.833333, 2.416667...
## $ points   <int> 25, 12, 24, 10, 22, 21, 21, 31, 24, 26, 31, 31, 23, 25, 21...


The data is a modified and shortened version of a longer data set examining students’ approaches to learning and study skills on a statistics course in 2014.

The shortened data contains 166 participants, and the following variables:

  • Gender (male, female)
  • Age (from 17 to 55 years)
  • Attitude = general attitude towards statistics (ranging between 1.4-5)
  • Deep = a sum variable measuring a deep way of learning (1.6-4.9)
  • Stra = a sum variable measuring a strategic, organized way of learning (1.3-5)
  • Surf = a sum variable measuring a surface, shallow way of learning (1.6-4.3)
  • Points = exam points, how well one did in a statistics exam (7-33)

The values of points and attitude are integers, other numerical values have decimals.

Examining the data further

glimpse(lrn14) #gender is the first column
## Rows: 166
## Columns: 7
## $ gender   <chr> "F", "M", "F", "M", "M", "F", "M", "F", "M", "F", "M", "F"...
## $ age      <int> 53, 55, 49, 53, 49, 38, 50, 37, 37, 42, 37, 34, 34, 34, 35...
## $ attitude <dbl> 3.7, 3.1, 2.5, 3.5, 3.7, 3.8, 3.5, 2.9, 3.8, 2.1, 3.9, 3.8...
## $ deep     <dbl> 3.583333, 2.916667, 3.500000, 3.500000, 3.666667, 4.750000...
## $ stra     <dbl> 3.375, 2.750, 3.625, 3.125, 3.625, 3.625, 2.250, 4.000, 4....
## $ surf     <dbl> 2.583333, 3.166667, 2.250000, 2.250000, 2.833333, 2.416667...
## $ points   <int> 25, 12, 24, 10, 22, 21, 21, 31, 24, 26, 31, 31, 23, 25, 21...
#pairs function gave an error about gender having an 'invalid color name'
#Lets try with a recoded value of gender where M=1 and F=2.

lrn142 <- lrn14 %>% 
  mutate(gender_num = recode(gender, "M"="1", "F"="2")) 
#recoding it also to be numeric
lrn142 <- lrn142 %>%
  mutate(gender_num = as.numeric(gender_num))

mean <- mean(lrn142$gender_num)
mean #mean 1.7 and would be 1.5 if even men and women, so more women in the data
## [1] 1.662651
#pairs(lrn14[c(-1)], col=lrn14$gender_num) # I used this to check colours -> red is 2=females, black is 1= males

#a scatter plot of the variables (gender only shown with colours)
pairs(lrn142[c(-1,-8)], col=lrn142$gender_num) 

#Lets make a plot with scatter plots, correlations, and distributions
plot_c <- ggpairs(lrn14, 1:7, progress = ggmatrix_progress(clear=T), mapping = aes(col=gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)), proportions = "auto")

plot_c

#Lets make and print the same plot but without genders separated
plot <- ggpairs(lrn14, 2:7, progress = ggmatrix_progress(clear=T), lower = list(combo = wrap("facethist", bins = 20)), proportions = "auto")
plot

summary(lrn14) #shows number of observations for character variables, and range and values, such as mean, for numerical variables.
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00



Based on the plots and summary information:

  • People are mainly young (the age variable is skewed to the right, mean 25.5 years)
  • The variables attitude, stra, and surf are quite normally distributed
  • The deep variable is a bit skewed to the left
  • the points variable is somewhat skewed to the left (mean=22.7 out of 33.0 points)
  • The plots show that the data contains more females than males, and indicate some gender differences. For instance, male may have a more positive attitude towards statistics.

Three strongest correlations in the whole data are between:

  1. attitude and points (r= .44)
  2. deep and surf (r= -.32)
  3. attitude and surf (r= -.18)

Making a regression model that predicts exam points with attitude, strategic way of learning, and gender

#Fitting a linear model
r3_model <- lm(points ~ attitude + stra + gender, data = lrn14)
summary(r3_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + gender, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.7179  -3.3285   0.5343   3.7412  10.9007 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9798     2.4030   3.737 0.000258 ***
## attitude      3.5100     0.5956   5.893 2.13e-08 ***
## stra          0.8911     0.5441   1.638 0.103419    
## genderM      -0.2236     0.9248  -0.242 0.809231    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.304 on 162 degrees of freedom
## Multiple R-squared:  0.2051, Adjusted R-squared:  0.1904 
## F-statistic: 13.93 on 3 and 162 DF,  p-value: 3.982e-08
#strategic way of learning or gender were not significantly (p >.05, even > 0.1) related to the exam points. This significance level is not enough to reliably reject hypothesis 0 of the t-test (which states that no relation exists between a particular x variable and the y variable). 
#Lets make the model with only the significant predictor, i.e. attitude.
r1_model <- lm(points ~ attitude, data = lrn14)
summary(r1_model)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The new model shows that mathematically:
exam point= 11.64 + 3.53*attitude (+e)

The intercept is 11.64 which means that if attitude was 0, this would be the predicted value of y. This is not so clear to interpret now (just shows that y is not 0 when x would be 0). However, I think that the x-variable can be centered so that the mean value of x is zero. In that case the intercept would be the predicted (mean) y-value at the mean of x.

The 3.53 beta value of attitude is quite strong. It shows that for each one point change in the attitude variable, the points in exam go up 3.5 points. That is, the more positive attitude one has, the more likely one does well in the exam.

Both the intercept and attitude are significant at p < .001. This means that H0 is rejected for attitude and H1 can be accepted, i.e. there is a relationship between attitude and points. For the intercept, the significant t-test means that the intercept is not 0.

In short, of attitude, strategic way of learning, and gender, only attitude significantly (p<.001) predicted exam points. With each one point change in the attitude variable, the points in exam are predicted to go up 3.5 points.

Other model information from the model summary:

  • While the t-tests show the significance of a specific variable, the F-test shows whether any of the predictors are significant. Since the model contains a significant predictor, the F-test is also significant (p <.001).
  • The R squared (now both adjusted R-squared and the basic R-squared) is 0.19. This means that the attitude variable is able to explain 19% of the variation (variance) in the exam points.
    • The adjusted R-squared is overall a better measure than R-squared because it adjusts the explanatory power compared to chance (e.g., having many predictors may simply by chance increase the prediction of y)
#making a plot to view the final model
qplot(attitude, points, data = lrn14) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Creating diagnostic plots and examining model assumptions

#Plots for residuals vs fitted values (1), normal QQ-plot (2), and residuals vs. leverage (5)
plot(r1_model,which= c(1,2,5),
     par(mfrow = c(2,2)))

A linear regression model has (among other things) the following assumptions in order to guarantee its predictions validity/reliability:

  • The error terms (i.e., residuals) of the model are normally distributed
  • The residuals have constant variance on all x-variable values
  • The data does not contain troublesome outliers
  • There is a linear relationship between the y-variable and x-variables

From the plots we see that:

  • the normality of errors is probably acceptable although some breakage from normality does exist on both ends (QQ-plot)
  • The residuals seem to have a constant variance (the plot of residuals vs. fitted values)
  • A couple observations have leverage of > 0.04. The model is probably acceptable without removing any of these observations






Logistic regression analysis (week3)

#loading libraries
library(tibble)
library(dplyr)
library(tidyr)
library(tidyverse)
library(ggplot2)
library(GGally)
library(plotly)
library(foreign)
library(tibble)
library(foreign)
library(corrgram)
library(psych)

1. General information

#Reading in the joined (student alcohol consumption) data from online (due to the comment in the discussion forum about the correct Ns - I had too many....Although this seems to be the case also in the online link, oh well.)

alc <-read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep=",", header=TRUE)

glimpse(alc) #382 participants, 35 variables.
## Rows: 382
## Columns: 35
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "G...
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, ...
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "...
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", ...
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3,...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2,...
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "ser...
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other...
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation...
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "...
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "ye...
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mothe...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1,...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1,...
## $ failures   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,...
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no",...
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "ye...
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes...
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", ...
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", ...
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5,...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5,...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5,...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4,...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5,...
## $ absences   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, ...
## $ G1         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 1...
## $ G2         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 1...
## $ G3         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, ...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
head(alc,6)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher     course
## 2     GP   F  17       U     GT3       T    1    1  at_home    other     course
## 3     GP   F  15       U     LE3       T    1    1  at_home    other      other
## 4     GP   F  15       U     GT3       T    4    2   health services       home
## 5     GP   F  16       U     GT3       T    3    3    other    other       home
## 6     GP   M  16       U     LE3       T    4    3 services    other reputation
##   nursery internet guardian traveltime studytime failures schoolsup famsup paid
## 1     yes       no   mother          2         2        0       yes     no   no
## 2      no      yes   father          1         2        0        no    yes   no
## 3     yes      yes   mother          1         2        3       yes     no  yes
## 4     yes      yes   mother          1         3        0        no    yes  yes
## 5     yes       no   father          1         2        0        no    yes  yes
## 6     yes      yes   mother          1         2        0        no    yes  yes
##   activities higher romantic famrel freetime goout Dalc Walc health absences G1
## 1         no    yes       no      4        3     4    1    1      3        6  5
## 2         no    yes       no      5        3     3    1    1      3        4  5
## 3         no    yes       no      4        3     2    2    3      3       10  7
## 4        yes    yes      yes      3        2     2    1    1      5        2 15
## 5         no    yes       no      4        3     2    1    2      5        4  6
## 6        yes    yes       no      5        4     2    1    2      5       10 15
##   G2 G3 alc_use high_use
## 1  6  6     1.0    FALSE
## 2  5  6     1.0    FALSE
## 3  8 10     2.5     TRUE
## 4 14 15     1.0    FALSE
## 5 10 10     1.5    FALSE
## 6 15 15     1.5    FALSE
  • Information about the data
    • participants (382) are secondary school students from two Portuguese
      schools
    • data contains background variables (e.g., age, sex, school), school and social life variables (e.g., number of school absenses, alcohol use), and information about grades in mathematics and Portuguese language
    • for a comprehensive description of the dataset see: https://archive.ics.uci.edu/ml/datasets/Student+Performance

The variables that will be examined here

  • The target (dependent) variable:
    • high/low alcohol consumption
  • The predictors:
    • studytime
    • failures
    • absences (from school)
    • G3 i.e. average of final grade in Portuguese and math

The Hypotheses:

Alcohol consumption will be…

  1. negatively related to weekly study time and final school grade
  2. positively related to school absences and failed classes

2. Looking into the variable distributions and their relationships

# choosing only the variables of interest
alc_s <- alc %>%
select(studytime,failures,absences,G3, alc_use, high_use) 

#transforming data into long format for plotting
alc_sl <- alc_s %>%
  gather(var, measure, -high_use)

#Visualising the distributions
alc_sl %>%
  ggplot(aes(measure)) + geom_density() +
  facet_wrap(~var, scales = "free_x", "free_y") +
  theme_bw()
## Warning: Coercing `nrow` to be an integer.
## Warning in sanitise_dim(nrow): NAs introduced by coercion
## Warning: `nrow` is missing or less than 1 and will be treated as NULL.

#Looking into means and other summary statistics
summary(alc_s)
##    studytime        failures         absences            G3       
##  Min.   :1.000   Min.   :0.0000   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.: 0.000   1st Qu.: 8.00  
##  Median :2.000   Median :0.0000   Median : 3.000   Median :11.00  
##  Mean   :2.034   Mean   :0.2906   Mean   : 5.319   Mean   :10.39  
##  3rd Qu.:2.000   3rd Qu.:0.0000   3rd Qu.: 8.000   3rd Qu.:14.00  
##  Max.   :4.000   Max.   :3.0000   Max.   :75.000   Max.   :20.00  
##     alc_use       high_use      
##  Min.   :1.000   Mode :logical  
##  1st Qu.:1.000   FALSE:270      
##  Median :1.500   TRUE :112      
##  Mean   :1.877                  
##  3rd Qu.:2.500                  
##  Max.   :5.000
#the percentage in the high use (and not high use) categories
prop.table(table(alc$high_use))
## 
##     FALSE      TRUE 
## 0.7068063 0.2931937
#Looking into variable relationships
ggpairs(alc_s,1:5, progress = ggmatrix_progress(clear=T), mapping = aes(alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)), proportions = "auto")

# a visual correlation matrix
corrgram(alc_s, order=T, cor.method="spearman")

#examining the predictors separately among students with high vs low alcohol use, red is higher use.
ggpairs(alc_s, 1:4, progress = ggmatrix_progress(clear=T), mapping = aes(col=high_use, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20, size=2.0)), proportions = "auto")

Comments about the variables and the hypothesis

  • Distributions:
    • G3 is quite normally distributed
    • most common studytime is 2-5 hrs a week (value 2)
    • the majority has no class failures or absences
    • alcohol use is skewed to the right i.e. mainly low: 71% of the students are in the low alcohol use category and 29% are in the high use category.
  • Relations between the predictors and alcohol use:
    • The correlations are quite small but oriented as predicted in the hypotheses, except for G3 which is practically not related to alcohol use.

3. Fitting a logistic regression model

#A logistic regression model for predicting alcohol consumption (high vs low) with the chosen four predictors
m <- glm(high_use ~ failures + absences + studytime + G3, data = alc, family = "binomial")

# printing out the model summary
summary(m)
## 
## Call:
## glm(formula = high_use ~ failures + absences + studytime + G3, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5848  -0.8046  -0.6663   1.1027   2.2046  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.43155    0.46417  -0.930 0.352522    
## failures     0.34172    0.16495   2.072 0.038296 *  
## absences     0.06262    0.01759   3.560 0.000371 ***
## studytime   -0.53006    0.15863  -3.341 0.000833 ***
## G3           0.01105    0.02852   0.388 0.698313    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 423.47  on 377  degrees of freedom
## AIC: 433.47
## 
## Number of Fisher Scoring iterations: 4
#creating coefficients as odds ratios
OR <- coef(m) %>% exp

# computing confidence intervals for the odds ratios
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
#combining and printing the ORs and CIs
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.6495044 0.2590351 1.6065887
## failures    1.4073732 1.0190133 1.9541493
## absences    1.0646266 1.0308109 1.1041295
## studytime   0.5885684 0.4271793 0.7966834
## G3          1.0111158 0.9568568 1.0703968
#testing: would G3 be significant as the only predictor
m2 <- glm(high_use ~ G3, data = alc, family = "binomial") 
summary(m2) #no, not significant
## 
## Call:
## glm(formula = high_use ~ G3, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8933  -0.8349  -0.8122   1.5495   1.6227  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.71272    0.26843  -2.655  0.00793 **
## G3          -0.01621    0.02379  -0.681  0.49572   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 461.75  on 380  degrees of freedom
## AIC: 465.75
## 
## Number of Fisher Scoring iterations: 4

3.1 Interpreting the model:

The summary shows that three of the predictors significantly predict low vs. high alcohol use. Compared to those with low alcohol use, lower study time (p<.001), and more absences (p<.001) and class failures (p<.05) are more typical for those with high alcohol use. Final grade did not significantly differ between the alcohol use groups. Except for the grade variable the relations are as stated in the hypothesis.

The odds ratios (OR) and their confidence intervals show that, with 95% confidence, the odds ratio for…

  • studytime is between 0.43-0.80
  • absences is between 1.03 -1.10
  • failures is between 1.02-1.95

In addition, G3 had an odds ratio of 1.01 and its CI went above and below 1. This matches the fact that it does not significantly predict the alcohol use category. Both groups are equally likely (OR=1) to get specific grades. Also the near 1 OR of absences (OR= 1.06) reflects the low significance (and usefulness) of this variable.
Overall, OR < 1 means that the examined feature is less likely in the “high” group, and OR > 1 mean that the feature is more likely in the “high” group.

3.2 The predictive power of a modified model

#choosing the statistically significant variables from the first model, and making a new model with them
m_f <- glm(high_use ~ failures + absences + studytime, data = alc, family = "binomial")

#Exploring the predictive power of this modified model

# predicting the probability of high_use
probabilities <- predict(m_f, type = "response")

# adding the predicted probabilities to the data
alc <- mutate(alc, probability = probabilities)

#predicting high_use with the probabilities
alc <- mutate(alc, prediction = probability > 0.5)

select(alc, failures, absences, studytime, high_use, probability, prediction) %>% tail(10) #two misclassifications can be seen with these 10 observations (prediction based on model probabilities vs. actual high_use)
##     failures absences studytime high_use probability prediction
## 373        1        0         1    FALSE   0.3714011      FALSE
## 374        1       14         1     TRUE   0.5879619       TRUE
## 375        0        2         3    FALSE   0.1450987      FALSE
## 376        0        7         1    FALSE   0.4011299      FALSE
## 377        1        0         3    FALSE   0.1702142      FALSE
## 378        0        0         2    FALSE   0.2025250      FALSE
## 379        1        0         2    FALSE   0.2582354      FALSE
## 380        1        0         2    FALSE   0.2582354      FALSE
## 381        0        3         1     TRUE   0.3423836      FALSE
## 382        0        0         1     TRUE   0.3011900      FALSE
# Creating a 2x2 cross tabulation of the target variable, i.e. actual case of high alcohol use, versus the predicted high vs. low use.
table(high_use = alc$high_use, prediction = alc$prediction) #This gives numbers
##         prediction
## high_use FALSE TRUE
##    FALSE   256   14
##    TRUE     89   23
# The same tabulation but in percentages, and with error sums
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67015707 0.03664921 0.70680628
##    TRUE  0.23298429 0.06020942 0.29319372
##    Sum   0.90314136 0.09685864 1.00000000
# Making a plot to visualise the prediction accuracy (the false positives and false negatives that the model creates are colored with blue in the plot)
g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction)) + geom_point()
g

#computing the average number of incorrect predictions.
# 1) a loss function i.e. define mean prediction error
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
# 2)computing incorrectly classified cases for our model
loss_func(class = alc$high_use, prob = alc$probability) #27% cases incorrectly classified. This can also be obtained by combining the wrongly classified percentages from the cross-tabulation table (0.23 + 0.037)
## [1] 0.2696335

Based on the cross-tabulations of actual and predicted alcohol use class, and the value obtained from the loss_function, the model has an accuracy rate of (100-27) = 73%. This means that overall, 27% of cases are incorrectly classified into high or low consumption class. Since a random sorting (through numerous guesses) to these two classes would give an accuracy rate of 50%, the model seems to be an improvement to this.

3.3 Lastly, some cross-validation of the model

# Performing a 10-fold cross-validation on the modified model
library(boot)
## 
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
## 
##     logit
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m_f, K = 10)
# average number of wrong predictions, i.e. prediction error (The cv.glm function saves the error in delta)
cv$delta[1] 
## [1] 0.2905759

The model has a prediction error of 0.28, which is not smaller (better) than the one in DataCamp of 0.26.

Could I find a better model? Perhaps, by looking into all variable correlations with the alcohol use variable and choosing the ones that correlate with it the strongest. But how much better would this model be, it shall remain a mystery for now.





4. Clustering and classification (week4)

4.1. Information about the data

#loading the MASS package
library(MASS)
#loading the data from it
data("Boston")


dim(Boston)#506 obs, 14 var.
## [1] 506  14
str(Boston) # all values are numerical, 3 have integer values and the rest have decimals.
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
head(Boston,10)
##       crim   zn indus chas   nox    rm   age    dis rad tax ptratio  black
## 1  0.00632 18.0  2.31    0 0.538 6.575  65.2 4.0900   1 296    15.3 396.90
## 2  0.02731  0.0  7.07    0 0.469 6.421  78.9 4.9671   2 242    17.8 396.90
## 3  0.02729  0.0  7.07    0 0.469 7.185  61.1 4.9671   2 242    17.8 392.83
## 4  0.03237  0.0  2.18    0 0.458 6.998  45.8 6.0622   3 222    18.7 394.63
## 5  0.06905  0.0  2.18    0 0.458 7.147  54.2 6.0622   3 222    18.7 396.90
## 6  0.02985  0.0  2.18    0 0.458 6.430  58.7 6.0622   3 222    18.7 394.12
## 7  0.08829 12.5  7.87    0 0.524 6.012  66.6 5.5605   5 311    15.2 395.60
## 8  0.14455 12.5  7.87    0 0.524 6.172  96.1 5.9505   5 311    15.2 396.90
## 9  0.21124 12.5  7.87    0 0.524 5.631 100.0 6.0821   5 311    15.2 386.63
## 10 0.17004 12.5  7.87    0 0.524 6.004  85.9 6.5921   5 311    15.2 386.71
##    lstat medv
## 1   4.98 24.0
## 2   9.14 21.6
## 3   4.03 34.7
## 4   2.94 33.4
## 5   5.33 36.2
## 6   5.21 28.7
## 7  12.43 22.9
## 8  19.15 27.1
## 9  29.93 16.5
## 10 17.10 18.9

The dataset contains information related to housing in the suburbs of Boston. There are 506 rows and 14 columns (variables) in the data. A few examples of variable measures are:

  • per capita crime rate by town
  • proportion of blacks by town
  • median value of owner-occupied homes

More information about the data can be found here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

Let’s take a closer look at the variables

#loading packages
library(tidyverse)
library(corrplot)
library(GGally)

#a graphical overview of the data
#lets do pairs on parts of data at a time to get a clearer picture
pairs(Boston[1:7])

pairs(Boston[7:14])

#looking at distributions and correlations
ggpairs(Boston,lower = list(combo = wrap("facethist", bins = 20)),
             upper = list(continuous = wrap("cor", size = 1.5)))

# calculating the correlation matrix and rounding it
cor_matrix<-cor(Boston) %>% round(2)
#looking into the correlations
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualizing the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d",tl.cex=0.6)

#show summaries of the variables
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
#two additional histograms
hist(Boston$black)

hist(Boston$lstat)

  • About the distributions
    • Most variables are skewed, for instance: crim (mainly low crime rate), zn (low proportion of residential land zoned for lots over 25,000 sq.ft.), chas (most tracts do not bound river), black (in most suburbs more than half of the residents are black). In addition, most owner-occupied units are built prior to 1940 (age variables shows this as a proportion) and most of the suburbs have mainly a population above low status.
  • About the variable relations
    • The data contains quite many correlations that are above |.50|.
    • Some of the strongest correlations in the data are between:
      • rad and tax (r = .91)
      • nox and indus (r= .76)
      • dis and age (r = -.75)
      • medv and lstat (r= -.74)
      • dis and indus (r= -.71)

4.2. Modifying the data

Standardizing the variables

#centering and standardizing the variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled) #matrix
## [1] "matrix" "array"
# transforming it into a data frame
boston_scaled <- as.data.frame(boston_scaled)

In the standardized (and centered) data all variable means are now zero. Also the min and max values changed (min is now negative), and range became smaller for many of the variables due to the applied transformation formula.

Creating a categorical variable of the scaled crime rate

#creating a quantile vector of crim and print it
q <- quantile(boston_scaled$crim)
q
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
#Using the quantiles as the break points in the new categorical variable
crime <- cut(boston_scaled$crim, breaks = q, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))

#looking into the new factor crime variable
table(crime) #seems fine
## crime
##      low  med_low med_high     high 
##      127      126      126      127
#adding the crime variable to the scaled data
boston_scaled <- data.frame(boston_scaled, crime)

#dropping the original crim variable from the data
boston_scaled <- dplyr::select(boston_scaled, -crim)

Dividing the data into train and test sets

#separating the data so that 80% of the data belongs to the train set (and 20% to the test set)

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# randomly choosing 80% of the rows
training <- sample(n,  size = n * 0.8)

# create the train set
train <- boston_scaled[training,]

# create the test set 
test <- boston_scaled[-training,]

dim(test) #102 obs
## [1] 102  14
dim(train) #404 obs
## [1] 404  14

4.3. Linear discriminant analysis

#Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. 

#fitting the LDA
lda.fit <- lda(crime ~ ., data = train)
#printing the model object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2475248 0.2500000 0.2500000 0.2524752 
## 
## Group means:
##                  zn      indus        chas        nox          rm        age
## low       0.9530028 -0.8680718 -0.07547406 -0.8694260  0.47338101 -0.8685929
## med_low  -0.1217232 -0.2431746 -0.03844192 -0.5372942 -0.13287060 -0.3174653
## med_high -0.3836558  0.1861685  0.11748284  0.3714056  0.05709808  0.4088008
## high     -0.4872402  1.0171096 -0.11793298  1.0835551 -0.43919419  0.8130653
##                 dis        rad        tax     ptratio      black       lstat
## low       0.8891078 -0.6844182 -0.7331747 -0.43397563  0.3785785 -0.76829799
## med_low   0.2849371 -0.5406779 -0.4363085 -0.01741929  0.3113557 -0.14807598
## med_high -0.3607598 -0.4337909 -0.3299184 -0.29273319  0.1130580  0.04406341
## high     -0.8538611  1.6382099  1.5141140  0.78087177 -0.8523589  0.92083793
##                medv
## low       0.5342109
## med_low  -0.0250977
## med_high  0.2025891
## high     -0.7450622
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.10968814  0.730949398 -0.88967173
## indus    0.03566585 -0.182727481  0.15434524
## chas    -0.02326417 -0.004070901  0.04921607
## nox      0.43521232 -0.777535396 -1.35198712
## rm       0.02055519  0.055590579 -0.13331260
## age      0.25878023 -0.272380780 -0.16780923
## dis     -0.03826764 -0.276581379 -0.07609605
## rad      3.47040716  0.863018610 -0.21908040
## tax     -0.08266086  0.043940245  0.74143794
## ptratio  0.16171480 -0.019746517 -0.22934277
## black   -0.12173775  0.061005808  0.16489425
## lstat    0.19338101 -0.283219415  0.17765894
## medv     0.08191788 -0.561022102 -0.38577006
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9567 0.0323 0.0110
# target classes as numeric
target <- as.numeric(train$crime)

#Draw the LDA (bi)plot
plot(lda.fit, dimen = 2, col=target, pch=target)

#Save the (correct) crime categories from the test set 
correct_crime <- test$crime

# remove the categorical crime variable from the test dataset
test <- dplyr::select(test, -crime)
#
#Predict the classes with the LDA model on the test data. 
lda.pred <- predict(lda.fit, newdata = test)
#
#Cross tabulate the results with the crime categories from the test set. 
table(correct = correct_crime, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16      10        1    0
##   med_low    3      19        3    0
##   med_high   0       6       17    2
##   high       0       0        0   25

At least before knitting:
The LDA predicted all high crime rate cases correctly, and most of the medium high, medium low and low crime rate instances also correctly. Still, med_high and low cases had a notable proportion (>40%) of cases misclassified. The plot also shows visually that the high crime cases are clearly separate from most of the other classes’ cases.

The lda.fit object also showed that the first dimension explained 95% of the (overall explained) variance between the crime rate categories, the second dimension explained 3.6% and the last 1.2% of the variance covered by the model (before knitting!). On the first dimension, rad (= accessibility to radial highways) mostly separated the crime rates. The group means show that it is the highest in high crime rate areas.

4.4. Clustering the data

#Reload the Boston dataset
data("Boston")

#standardize the dataset (to get comparable distances below). 
#saving the sclaed data as a data frame
boston_scaled2 <- scale(Boston) %>% as.data.frame()

#looking into last 10 observations
tail(boston_scaled2,10)
##           crim         zn      indus       chas       nox         rm
## 497 -0.3864333 -0.4872402 -0.2108898 -0.2723291 0.2615253 -1.2732886
## 498 -0.3889003 -0.4872402 -0.2108898 -0.2723291 0.2615253 -0.6982955
## 499 -0.3923020 -0.4872402 -0.2108898 -0.2723291 0.2615253 -0.3780642
## 500 -0.3994275 -0.4872402 -0.2108898 -0.2723291 0.2615253 -1.0185268
## 501 -0.3940157 -0.4872402 -0.2108898 -0.2723291 0.2615253 -0.3666782
## 502 -0.4128204 -0.4872402  0.1156240 -0.2723291 0.1579678  0.4388814
## 503 -0.4148387 -0.4872402  0.1156240 -0.2723291 0.1579678 -0.2343159
## 504 -0.4130378 -0.4872402  0.1156240 -0.2723291 0.1579678  0.9839863
## 505 -0.4073609 -0.4872402  0.1156240 -0.2723291 0.1579678  0.7249547
## 506 -0.4145899 -0.4872402  0.1156240 -0.2723291 0.1579678 -0.3624084
##             age        dis        rad        tax  ptratio     black       lstat
## 497  0.15365093 -0.4732098 -0.4076377 -0.1022751 0.343873 0.4406159  1.18846991
## 498  0.07194248 -0.4285218 -0.4076377 -0.1022751 0.343873 0.4406159  0.20262208
## 499 -0.11634223 -0.6581830 -0.4076377 -0.1022751 0.343873 0.4406159  0.03738054
## 500  0.17496618 -0.6625521 -0.4076377 -0.1022751 0.343873 0.4282384  0.34265729
## 501  0.39522376 -0.6158695 -0.4076377 -0.1022751 0.343873 0.4406159  0.23483018
## 502  0.01865435 -0.6251775 -0.9818712 -0.8024176 1.175303 0.3868341 -0.41773387
## 503  0.28864751 -0.7159308 -0.9818712 -0.8024176 1.175303 0.4406159 -0.50035464
## 504  0.79666096 -0.7729187 -0.9818712 -0.8024176 1.175303 0.4406159 -0.98207574
## 505  0.73626775 -0.6677760 -0.9818712 -0.8024176 1.175303 0.4028263 -0.86444617
## 506  0.43430172 -0.6126402 -0.9818712 -0.8024176 1.175303 0.4406159 -0.66839688
##            medv
## 497 -0.30801068
## 498 -0.46023251
## 499 -0.14491587
## 500 -0.54721641
## 501 -0.62332733
## 502 -0.01444002
## 503 -0.21015379
## 504  0.14865480
## 505 -0.05793197
## 506 -1.15610373
#and a summary
summary(boston_scaled2)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
#Calculate the distances between the observations.
#First,euclidean distance:
#matrix
dist_eu <- dist(boston_scaled2) #euclidean distance is the default
#summary
summary(dist_eu) # median distance is 4.8 (out of 14.4)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
#Second,manhattan distance:
# matrix
dist_man <- dist(boston_scaled2, method="manhattan")
#summary 
summary(dist_man) # median distance is 12.6 (out of 48.9)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618
#Run k-means algorithm on the dataset. 
km <-kmeans(boston_scaled2, centers = "4")

#Investigate what is the optimal number of clusters and run the algorithm again. 

#looking at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes; the optimal number of clusters is when the total WCSS drops radically.

#to always use the same initial cluster center
set.seed(123)
#Let's check how the WCSS change through 10 different cluster amounts (1 to 10 clusters).
k_max <- 10 #the max number of clusters
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss}) #calculating the WCSS

#examining the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

#The biggest drop comes at cluster change from 1 to 2 so 2 clusters seems to be the best choice.

#Let's rerun the k-means clustering
km_f <-kmeans(boston_scaled2, centers = "2")

#plotting (visualising) the clusters (in two subsets to see the results clearer)
pairs(boston_scaled2[1:7], col =km_f$cluster)

pairs(boston_scaled2[8:14], col =km_f$cluster)

#transforming km_f into a factor for the visualization
km_f$cluster <- as.factor(km_f$cluster)

#some further visualization with the two subsets
ggpairs(boston_scaled2[1:7],lower = list(combo = wrap("facethist", bins = 20)),upper = list(continuous = wrap("cor", size = 1.5)),  mapping=ggplot2::aes(colour = km_f$cluster, alpha=0.5))

ggpairs(boston_scaled2[8:14],lower = list(combo = wrap("facethist", bins = 20)),upper = list(continuous = wrap("cor", size = 1.5)),  mapping=ggplot2::aes(colour = km_f$cluster, alpha=0.5))

The plots indicate that most of the variables are differently distributed among the two clusters. Still, it is difficult to say which variables would differ significantly. At least crim (crime rate), indus (amount of non-retail business acres), nox (nitrogen oxides concentration), lstat (amount of lower status in the population), tax (property-tax), rad (accessibility to radial highways), dis (distance to employment centers), and age appear to separate between the clusters. Overall, the clusters seem to partially differentiate between higher and lower-class neighborhoods.


Bonus:

  1. Fitting an additional K-means cluster analysis with 3 clusters to the standardized data,
    and 2. Performing an LDA using the clusters as target classes
#Performing the cluster analysis
km3 <-kmeans(boston_scaled2, centers = "3")

#making the target variable numeric
clusters <- as.numeric(km3$cluster)
#adding the target variable into the data
boston_scaled2 <- data.frame(boston_scaled2, clusters)


#OBS! I had to comment out the below code because the lda-function gave an error when knitting "variable 4 appears to be constant within groups calls". I could not find a way to fix it. The text in the end is based on the results that I got (before knitting the document).

#Including all the variables in the Boston data as predictors in the LDA
#lda.fit2 <- lda(clusters ~ ., data = boston_scaled2)

#printing the LDA object
#lda.fit2

#Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution).

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "blue", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

#creating the plot
#plot(lda.fit2, dimen = 2, col=clusters, pch=clusters)
#lda.arrows(lda.fit2, myscale = 4.5)

Based on the LDA object and the plot, the most influencial linear separators for the clusters are nox, tax, zn, and medv. Some other variables, such as age, also help in separating the clusters. Still, because the first discriminant function is the most important one (explaining 85% of the explained variance between the clusters), the variables that load to it the strongest separate the clusters the best, i.e. nox (nitrogen oxides concentration) and tax (property-tax). Looking at the group means, group1 seems to be the “worst” neighborhoods, group2 the “best”, and group3 something in between.




5. Dimensionality reduction techniques (week5)

5.1 Loading in data

#Loading in the data from my local data folder, row.names=1 keeps the country names as row names.
setwd("./data")
human_ <- read.csv("humanf.csv", row.names=1)

head(human_,4)
##             edu2_ftom labour_ftom Edu_exp life_exp   GNI mat_mor Adol_birth
## Norway      1.0072389   0.8908297    17.5     81.6 64992       4        7.8
## Australia   0.9968288   0.8189415    20.2     82.4 42261       6       12.1
## Switzerland 0.9834369   0.8251001    15.8     83.0 56431       6        1.9
## Denmark     0.9886128   0.8840361    18.7     80.2 44025       5        5.1
##             parl_f
## Norway        39.6
## Australia     30.5
## Switzerland   28.5
## Denmark       38.0
dim(human_) # seems good
## [1] 155   8
#loading libraries
library(GGally)
library(ggplot2)
library(corrplot)
library(tidyverse)

5.2 A graphical overview of the data and summaries of the variables

#variable summaries
summary(human_)
##    edu2_ftom       labour_ftom        Edu_exp         life_exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            mat_mor         Adol_birth         parl_f     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
#a graphical overview of the data
ggpairs(human_,lower = list(combo = wrap("facethist", bins = 20)), upper = list(continuous = wrap("cor", size = 3)))

#looking more into the correlations
# calculating the correlation matrix and rounding it
cor_matrix<-cor(human_) %>% round(2)
#looking into the correlations
cor_matrix
##             edu2_ftom labour_ftom Edu_exp life_exp   GNI mat_mor Adol_birth
## edu2_ftom        1.00        0.01    0.59     0.58  0.43   -0.66      -0.53
## labour_ftom      0.01        1.00    0.05    -0.14 -0.02    0.24       0.12
## Edu_exp          0.59        0.05    1.00     0.79  0.62   -0.74      -0.70
## life_exp         0.58       -0.14    0.79     1.00  0.63   -0.86      -0.73
## GNI              0.43       -0.02    0.62     0.63  1.00   -0.50      -0.56
## mat_mor         -0.66        0.24   -0.74    -0.86 -0.50    1.00       0.76
## Adol_birth      -0.53        0.12   -0.70    -0.73 -0.56    0.76       1.00
## parl_f           0.08        0.25    0.21     0.17  0.09   -0.09      -0.07
##             parl_f
## edu2_ftom     0.08
## labour_ftom   0.25
## Edu_exp       0.21
## life_exp      0.17
## GNI           0.09
## mat_mor      -0.09
## Adol_birth   -0.07
## parl_f        1.00
# visualizing the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d",tl.cex=0.6)

Descriptions of the variables:

  • edu2_ftom = Proportion of females / proportion of males with at least secondary education
  • labour_ftom = Proportion of females / proportion of males in the labour force
  • Edu_exp = Expected years of schooling
  • life_exp = Life expectancy at birth
  • GNI = Gross National Income per capita
  • mat_mor = Maternal mortality ratio
  • Adol_birth = Adolescent birth rate
  • parl_f = Percentange of female representatives in parliament

Some observations about the data:

  • In most countries, maternal mortality, GNI, and adolescent birth rate are relatively low.
  • Proportion of females to males with a secondary education is quite normally distributed among the countries, although men do tend to have higher education than women.
  • Expected years of schooling overall is more normally distributed
  • Some of the strongest correlations are between 1) maternal mortality and life expectancy (r= -0.86), 2) life expectancy and expected years of schooling (r = 0.79), and 3) adolescent birth rate and maternal mortality (r= 0.76)

5.3 Perform principal component analysis (PCA) and draw a biplot displaying the observations by the first two principal components

#PCA (with the SVD method)
pca_human <- prcomp(human_)
#printing the summary object and a summary of it
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation (n x k) = (8 x 8):
##                       PC1           PC2           PC3           PC4
## edu2_ftom   -5.607472e-06  0.0006713951 -3.412027e-05 -2.736326e-04
## labour_ftom  2.331945e-07 -0.0002819357  5.302884e-04 -4.692578e-03
## Edu_exp     -9.562910e-05  0.0075529759  1.427664e-02 -3.313505e-02
## life_exp    -2.815823e-04  0.0283150248  1.294971e-02 -6.752684e-02
## GNI         -9.999832e-01 -0.0057723054 -5.156742e-04  4.932889e-05
## mat_mor      5.655734e-03 -0.9916320120  1.260302e-01 -6.100534e-03
## Adol_birth   1.233961e-03 -0.1255502723 -9.918113e-01  5.301595e-03
## parl_f      -5.526460e-05  0.0032317269 -7.398331e-03 -9.971232e-01
##                       PC5           PC6           PC7           PC8
## edu2_ftom   -0.0022935252  2.180183e-02  6.998623e-01  7.139410e-01
## labour_ftom  0.0022190154  3.264423e-02  7.132267e-01 -7.001533e-01
## Edu_exp      0.1431180282  9.882477e-01 -3.826887e-02  7.776451e-03
## life_exp     0.9865644425 -1.453515e-01  5.380452e-03  2.281723e-03
## GNI         -0.0001135863 -2.711698e-05 -8.075191e-07 -1.176762e-06
## mat_mor      0.0266373214  1.695203e-03  1.355518e-04  8.371934e-04
## Adol_birth   0.0188618600  1.273198e-02 -8.641234e-05 -1.707885e-04
## parl_f      -0.0716401914 -2.309896e-02 -2.642548e-03  2.680113e-03
summary(pca_human) #this shows clearly the proportion of variance captured by each component. The first PC explains 99.9 % of the variance that is captured. It also shows the standard deviations.
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
#Drawing a biplot
biplot(pca_human, choices = 1:2, cex= c(1,1), col=c("orange", "navy"))

Interpreting and commenting on the results:

  • The first principal component explains 99.9 % of the variance that is captured, and the second PC in practice covers the little variance that is left.
  • Looking at the PCA results and the biplot it seems that (pretty much) the only variable which mattered for explaining the variance is GNI, i.e. countries’ Gross National Income per capita.
  • This is understandable in light of the variable distributions: GNI had by far the largest variance and range with its max value being over 100,000!
  • In sum, without standardizing the variables, PC1 that mainly reflects GNI is the most important underlying dimension when explaining the variance in the data.
  • Of the countries, Qatar seems to have the lowest standard of living (i.e., GNI)
#Let's draw the plot again but name the PC1 as "Low standard of living", see: http://hdr.undp.org/sites/default/files/hdr2015_technical_notes.pdf

biplot(pca_human, choices = 1:2, cex= c(1,1), col=c("orange", "navy"), xlab = "Low standard of living")

5.4 Standardizing the variables and running the PCA again

#standardizing the data
human_std <- scale(human_)

summary(human_std) #seems good.
##    edu2_ftom        labour_ftom         Edu_exp           life_exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             mat_mor          Adol_birth          parl_f       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
#PCA (with the SVD method)
pca_humanstd <- prcomp(human_std)

pca_humanstd
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                     PC1         PC2         PC3         PC4        PC5
## edu2_ftom   -0.35664370  0.03796058 -0.24223089  0.62678110 -0.5983585
## labour_ftom  0.05457785  0.72432726 -0.58428770  0.06199424  0.2625067
## Edu_exp     -0.42766720  0.13940571 -0.07340270 -0.07020294  0.1659678
## life_exp    -0.44372240 -0.02530473  0.10991305 -0.05834819  0.1628935
## GNI         -0.35048295  0.05060876 -0.20168779 -0.72727675 -0.4950306
## mat_mor      0.43697098  0.14508727 -0.12522539 -0.25170614 -0.1800657
## Adol_birth   0.41126010  0.07708468  0.01968243  0.04986763 -0.4672068
## parl_f      -0.08438558  0.65136866  0.72506309  0.01396293 -0.1523699
##                     PC6         PC7         PC8
## edu2_ftom    0.17713316  0.05773644  0.16459453
## labour_ftom -0.03500707 -0.22729927 -0.07304568
## Edu_exp     -0.38606919  0.77962966 -0.05415984
## life_exp    -0.42242796 -0.43406432  0.62737008
## GNI          0.11120305 -0.13711838 -0.16961173
## mat_mor      0.17370039  0.35380306  0.72193946
## Adol_birth  -0.76056557 -0.06897064 -0.14335186
## parl_f       0.13749772  0.00568387 -0.02306476
summary(pca_humanstd) # a more reasonable/useful finding with the  first PC explaining 54% of the variance, 2nd PC explaining 16% and third explaining 10% etc.
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
#Drawing the biplot
biplot(pca_humanstd, choices = 1:2, cex= c(0.8,0.9), col=c("lightpink", "darkblue"))

Interpreting and commenting on the results:

  • With the standardized variables the PCA results seem a lot more informative (and reliable!). This is because in PCA variables with high standard deviations (and variance overall) tend to get more weight than their importance would actually justify. Thus, standardization of variables is useful and often important before running a PCA.
  • The results show that the first PC covers 54% of the variance and the second PC covers 16% of the variance.
  • The biplot and the results show that the first dimension represents matters such as: life_exp (neg), edu_exp (neg), GNI (neg), Adol_birth (posit), mat_mor (posit), edu2_ftom (neg). Thus it could be named “Low standard of living and education”.
  • The second dimension (PC2) mainly captures variables measuring: labour_ftom (posit), parl_f (posit) and could thus be termed “Equality”.
  • The biplot shows that certain countries can quite clearly be set apart from the rest of the countries based on these two components. Similar countries also cluster together, for instance some European countries (Belgium, Netherlands) are located in the upper left corner/quarter of the plot with quite high equality.
#Let's draw the plot again so that PC1 is named "Standard of living and education" and PC2 is named "Equality".

biplot(pca_humanstd, choices = 1:2, cex= c(1,1), col=c("orange", "navy"), xlab = "Low standard of living and education", ylab= "Equality")

5.5 Loading and exploring a new dataset

#Load the tea dataset from the package Factominer.
library(FactoMineR)
data("tea")

dim(tea) #300 obs, 36 var. 
## [1] 300  36
head(tea,6) #data largely strings, text format variables
##       breakfast     tea.time     evening     lunch     dinner     always home
## 1     breakfast Not.tea time Not.evening Not.lunch Not.dinner Not.always home
## 2     breakfast Not.tea time Not.evening Not.lunch Not.dinner Not.always home
## 3 Not.breakfast     tea time     evening Not.lunch     dinner Not.always home
## 4 Not.breakfast Not.tea time Not.evening Not.lunch     dinner Not.always home
## 5     breakfast Not.tea time     evening Not.lunch Not.dinner     always home
## 6 Not.breakfast Not.tea time Not.evening Not.lunch     dinner Not.always home
##       work     tearoom     friends     resto     pub       Tea   How    sugar
## 1 Not.work Not.tearoom Not.friends Not.resto Not.pub     black alone    sugar
## 2 Not.work Not.tearoom Not.friends Not.resto Not.pub     black  milk No.sugar
## 3     work Not.tearoom     friends     resto Not.pub Earl Grey alone No.sugar
## 4 Not.work Not.tearoom Not.friends Not.resto Not.pub Earl Grey alone    sugar
## 5 Not.work Not.tearoom Not.friends Not.resto Not.pub Earl Grey alone No.sugar
## 6 Not.work Not.tearoom Not.friends Not.resto Not.pub Earl Grey alone No.sugar
##       how       where           price age sex          SPC         Sport age_Q
## 1 tea bag chain store       p_unknown  39   M       middle     sportsman 35-44
## 2 tea bag chain store      p_variable  45   F       middle     sportsman 45-59
## 3 tea bag chain store      p_variable  47   F other worker     sportsman 45-59
## 4 tea bag chain store      p_variable  23   M      student Not.sportsman 15-24
## 5 tea bag chain store      p_variable  48   M     employee     sportsman 45-59
## 6 tea bag chain store p_private label  21   M      student     sportsman 15-24
##   frequency     escape.exoticism     spirituality     healthy     diuretic
## 1     1/day Not.escape-exoticism Not.spirituality     healthy Not.diuretic
## 2     1/day     escape-exoticism Not.spirituality     healthy     diuretic
## 3    +2/day Not.escape-exoticism Not.spirituality     healthy     diuretic
## 4     1/day     escape-exoticism     spirituality     healthy Not.diuretic
## 5    +2/day     escape-exoticism     spirituality Not.healthy     diuretic
## 6     1/day Not.escape-exoticism Not.spirituality     healthy Not.diuretic
##       friendliness     iron.absorption     feminine     sophisticated
## 1 Not.friendliness Not.iron absorption Not.feminine Not.sophisticated
## 2 Not.friendliness Not.iron absorption Not.feminine Not.sophisticated
## 3     friendliness Not.iron absorption Not.feminine Not.sophisticated
## 4 Not.friendliness Not.iron absorption Not.feminine     sophisticated
## 5     friendliness Not.iron absorption Not.feminine Not.sophisticated
## 6 Not.friendliness Not.iron absorption Not.feminine Not.sophisticated
##      slimming    exciting    relaxing    effect.on.health
## 1 No.slimming No.exciting No.relaxing No.effect on health
## 2 No.slimming    exciting No.relaxing No.effect on health
## 3 No.slimming No.exciting    relaxing No.effect on health
## 4 No.slimming No.exciting    relaxing No.effect on health
## 5 No.slimming No.exciting    relaxing No.effect on health
## 6 No.slimming No.exciting    relaxing No.effect on health
str(tea) # variables are of type factor and most of them have two levels. The data measures matters related to tea consumption.
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
#visualizing the data with bar plots
#first, selecting parts of the data
tea1 <- dplyr::select(tea, 1:13)
tea2 <- dplyr::select(tea, 14:25)
tea3 <- dplyr::select(tea, 26:36)
#drawing the plots 
gather(tea1) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar () + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

gather(tea2) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar () + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

gather(tea3) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar () + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

5.6 Multiple Correspondence Analysis (MCA) on the subdata tea3

head(tea3,4)
##       spirituality healthy     diuretic     friendliness     iron.absorption
## 1 Not.spirituality healthy Not.diuretic Not.friendliness Not.iron absorption
## 2 Not.spirituality healthy     diuretic Not.friendliness Not.iron absorption
## 3 Not.spirituality healthy     diuretic     friendliness Not.iron absorption
## 4     spirituality healthy Not.diuretic Not.friendliness Not.iron absorption
##       feminine     sophisticated    slimming    exciting    relaxing
## 1 Not.feminine Not.sophisticated No.slimming No.exciting No.relaxing
## 2 Not.feminine Not.sophisticated No.slimming    exciting No.relaxing
## 3 Not.feminine Not.sophisticated No.slimming No.exciting    relaxing
## 4 Not.feminine     sophisticated No.slimming No.exciting    relaxing
##      effect.on.health
## 1 No.effect on health
## 2 No.effect on health
## 3 No.effect on health
## 4 No.effect on health
#MCA
mca <- MCA(tea3, graph = TRUE)

summary(mca) #the dimensions explain variance so that the 1st dimension explains 15%, the 2nd 13%, the third and fourth both 11% of variance. Altogether, the MCA produces 11 dimensions.
## 
## Call:
## MCA(X = tea3, graph = TRUE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.153   0.128   0.111   0.105   0.095   0.085   0.077
## % of var.             15.296  12.806  11.108  10.516   9.499   8.511   7.694
## Cumulative % of var.  15.296  28.102  39.210  49.726  59.225  67.736  75.430
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.070   0.069   0.057   0.050
## % of var.              7.049   6.851   5.714   4.955
## Cumulative % of var.  82.480  89.331  95.045 100.000
## 
## Individuals (the 10 first)
##                        Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                   |  0.659  0.947  0.380 | -0.343  0.307  0.103 |  0.573
## 2                   |  0.368  0.294  0.115 |  0.088  0.020  0.007 |  0.903
## 3                   |  0.137  0.041  0.030 | -0.446  0.518  0.315 |  0.297
## 4                   |  0.173  0.066  0.030 | -0.545  0.773  0.293 | -0.161
## 5                   |  0.244  0.130  0.062 | -0.207  0.112  0.045 | -0.119
## 6                   |  0.568  0.702  0.307 | -0.652  1.108  0.406 |  0.312
## 7                   |  0.568  0.702  0.307 | -0.652  1.108  0.406 |  0.312
## 8                   | -0.127  0.035  0.029 | -0.378  0.372  0.261 | -0.338
## 9                   |  0.610  0.810  0.362 | -0.299  0.233  0.087 | -0.215
## 10                  |  0.137  0.041  0.030 | -0.446  0.518  0.315 |  0.297
##                        ctr   cos2  
## 1                    0.987  0.288 |
## 2                    2.449  0.697 |
## 3                    0.264  0.139 |
## 4                    0.078  0.026 |
## 5                    0.043  0.015 |
## 6                    0.292  0.093 |
## 7                    0.292  0.093 |
## 8                    0.343  0.208 |
## 9                    0.139  0.045 |
## 10                   0.264  0.139 |
## 
## Categories (the 10 first)
##                        Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test
## Not.spirituality    |  0.222  2.004  0.108  5.673 |  0.004  0.001  0.000  0.109
## spirituality        | -0.486  4.393  0.108 -5.673 | -0.009  0.002  0.000 -0.109
## healthy             | -0.351  5.115  0.287 -9.262 | -0.286  4.074  0.191 -7.563
## Not.healthy         |  0.818 11.936  0.287  9.262 |  0.668  9.507  0.191  7.563
## diuretic            | -0.406  5.687  0.228 -8.253 |  0.210  1.815  0.061  4.266
## Not.diuretic        |  0.561  7.853  0.228  8.253 | -0.290  2.506  0.061 -4.266
## friendliness        | -0.171  1.407  0.122 -6.051 |  0.060  0.208  0.015  2.129
## Not.friendliness    |  0.715  5.870  0.122  6.051 | -0.252  0.868  0.015 -2.129
## iron absorption     | -0.759  3.538  0.066 -4.455 | -0.275  0.554  0.009 -1.614
## Not.iron absorption |  0.087  0.408  0.066  4.455 |  0.032  0.064  0.009  1.614
##                        Dim.3    ctr   cos2 v.test  
## Not.spirituality    |  0.208  2.434  0.095  5.328 |
## spirituality        | -0.456  5.335  0.095 -5.328 |
## healthy             |  0.258  3.812  0.155  6.813 |
## Not.healthy         | -0.602  8.894  0.155 -6.813 |
## diuretic            |  0.324  4.977  0.145  6.580 |
## Not.diuretic        | -0.447  6.873  0.145 -6.580 |
## friendliness        | -0.160  1.687  0.107 -5.647 |
## Not.friendliness    |  0.667  7.040  0.107  5.647 |
## iron absorption     |  0.580  2.841  0.039  3.402 |
## Not.iron absorption | -0.067  0.327  0.039 -3.402 |
## 
## Categorical variables (eta2)
##                       Dim.1 Dim.2 Dim.3  
## spirituality        | 0.108 0.000 0.095 |
## healthy             | 0.287 0.191 0.155 |
## diuretic            | 0.228 0.061 0.145 |
## friendliness        | 0.122 0.015 0.107 |
## iron.absorption     | 0.066 0.009 0.039 |
## feminine            | 0.304 0.027 0.058 |
## sophisticated       | 0.198 0.039 0.233 |
## slimming            | 0.236 0.052 0.038 |
## exciting            | 0.020 0.340 0.046 |
## relaxing            | 0.037 0.347 0.216 |
#Based on the eta squared values, the most important variables that capture the underlying dimensions seem to be: feminine and healthy on the first dimension, exciting and relaxing on the second dimension, and relaxing and sophisticated on the third dimension. Thus people view drinking tea differently and certain views tend to occur together (for instance viewing tea drinking as feminine and healthy). These dimensions also set people apart, characterizing people.

#drawing a plot
plot(mca, invisible=c("ind"), habillage = "quali")

#The plot further shows the relations between the different variables (closer together = more related) as well as between the variables and the dimensions.






6. Analysis of longitudinal data



This chapter follows MABS, which is a book by Vehkalahti & Everitt from 2019.

Loading two datasets to R (short descriptions given later)

#libraries
library(dplyr)
library(tidyr)
library(gtsummary)
library(ggplot2)

#loading the BPRS_l into R
BPRS_l <- read.csv("BPRS_l.csv", row.names=1)

dim(BPRS_l)
## [1] 360   5
str(BPRS_l)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
#transforming treatment and subject to factors (again)
BPRS_l$treatment <- factor(BPRS_l$treatment)
BPRS_l$subject <- factor(BPRS_l$subject)
head(BPRS_l,30) 
##    treatment subject weeks bprs week
## 1          1       1 week0   42    0
## 2          1       2 week0   58    0
## 3          1       3 week0   54    0
## 4          1       4 week0   55    0
## 5          1       5 week0   72    0
## 6          1       6 week0   48    0
## 7          1       7 week0   71    0
## 8          1       8 week0   30    0
## 9          1       9 week0   41    0
## 10         1      10 week0   57    0
## 11         1      11 week0   30    0
## 12         1      12 week0   55    0
## 13         1      13 week0   36    0
## 14         1      14 week0   38    0
## 15         1      15 week0   66    0
## 16         1      16 week0   41    0
## 17         1      17 week0   45    0
## 18         1      18 week0   39    0
## 19         1      19 week0   24    0
## 20         1      20 week0   38    0
## 21         2       1 week0   52    0
## 22         2       2 week0   30    0
## 23         2       3 week0   65    0
## 24         2       4 week0   37    0
## 25         2       5 week0   59    0
## 26         2       6 week0   30    0
## 27         2       7 week0   69    0
## 28         2       8 week0   62    0
## 29         2       9 week0   38    0
## 30         2      10 week0   65    0
tail(BPRS_l,20) #looks good
##     treatment subject weeks bprs week
## 341         2       1 week8   50    8
## 342         2       2 week8   20    8
## 343         2       3 week8   32    8
## 344         2       4 week8   23    8
## 345         2       5 week8   35    8
## 346         2       6 week8   21    8
## 347         2       7 week8   35    8
## 348         2       8 week8   66    8
## 349         2       9 week8   21    8
## 350         2      10 week8   39    8
## 351         2      11 week8   75    8
## 352         2      12 week8   23    8
## 353         2      13 week8   28    8
## 354         2      14 week8   27    8
## 355         2      15 week8   37    8
## 356         2      16 week8   22    8
## 357         2      17 week8   43    8
## 358         2      18 week8   30    8
## 359         2      19 week8   21    8
## 360         2      20 week8   23    8
#loading the RATS_l into R
RATS_l <- read.csv("RATS_l.csv", row.names=1)

dim(RATS_l)
## [1] 176   5
glimpse(RATS_l)
## Rows: 176
## Columns: 5
## $ ID     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,...
## $ Group  <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, ...
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, ...
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, ...
#transforming the ID and Group variables to factors:
RATS_l$ID <- factor(RATS_l$ID)
RATS_l$Group <- factor(RATS_l$Group)
head(RATS_l,30) 
##    ID Group  WD Weight Time
## 1   1     1 WD1    240    1
## 2   2     1 WD1    225    1
## 3   3     1 WD1    245    1
## 4   4     1 WD1    260    1
## 5   5     1 WD1    255    1
## 6   6     1 WD1    260    1
## 7   7     1 WD1    275    1
## 8   8     1 WD1    245    1
## 9   9     2 WD1    410    1
## 10 10     2 WD1    405    1
## 11 11     2 WD1    445    1
## 12 12     2 WD1    555    1
## 13 13     3 WD1    470    1
## 14 14     3 WD1    535    1
## 15 15     3 WD1    520    1
## 16 16     3 WD1    510    1
## 17  1     1 WD8    250    8
## 18  2     1 WD8    230    8
## 19  3     1 WD8    250    8
## 20  4     1 WD8    255    8
## 21  5     1 WD8    260    8
## 22  6     1 WD8    265    8
## 23  7     1 WD8    275    8
## 24  8     1 WD8    255    8
## 25  9     2 WD8    415    8
## 26 10     2 WD8    420    8
## 27 11     2 WD8    445    8
## 28 12     2 WD8    560    8
## 29 13     3 WD8    465    8
## 30 14     3 WD8    525    8
tail(RATS_l,20) #looks good
##     ID Group   WD Weight Time
## 157 13     3 WD57    518   57
## 158 14     3 WD57    544   57
## 159 15     3 WD57    555   57
## 160 16     3 WD57    553   57
## 161  1     1 WD64    278   64
## 162  2     1 WD64    245   64
## 163  3     1 WD64    269   64
## 164  4     1 WD64    275   64
## 165  5     1 WD64    280   64
## 166  6     1 WD64    281   64
## 167  7     1 WD64    284   64
## 168  8     1 WD64    278   64
## 169  9     2 WD64    478   64
## 170 10     2 WD64    496   64
## 171 11     2 WD64    472   64
## 172 12     2 WD64    628   64
## 173 13     3 WD64    525   64
## 174 14     3 WD64    559   64
## 175 15     3 WD64    548   64
## 176 16     3 WD64    569   64

Analysing the RATS data (following Chapter 8 of MABS)

The data is described as follows in the book: The RATS data is part of a larger "data from a nutrition study conducted in three groups of rats (Crowder and Hand, 1990). The three groups were put on different diets, and each animal’s body weight (grams) was recorded repeatedly (approximately weekly, except in week seven when two recordings were taken) over a 9-week period. The question of most interest is whether the growth profiles of the three groups differ.

# A summary table of the RATS data:

#Print out a summary table of all variables
my_table <- tbl_summary(RATS_l,
                        digits = all_continuous() ~ 2,
                        type = all_continuous() ~ "continuous2",
                        statistic = list(all_continuous() ~ c("{mean} ({sd})",
                                                              "{median} ({p25}, {p75})", 
                                                              "{min}, {max}")))

#Add some missing elements to the table and print it out
my_table %>% bold_labels()
Characteristic N = 176
ID
1 11 (6.2%)
2 11 (6.2%)
3 11 (6.2%)
4 11 (6.2%)
5 11 (6.2%)
6 11 (6.2%)
7 11 (6.2%)
8 11 (6.2%)
9 11 (6.2%)
10 11 (6.2%)
11 11 (6.2%)
12 11 (6.2%)
13 11 (6.2%)
14 11 (6.2%)
15 11 (6.2%)
16 11 (6.2%)
Group
1 88 (50%)
2 44 (25%)
3 44 (25%)
WD
WD1 16 (9.1%)
WD15 16 (9.1%)
WD22 16 (9.1%)
WD29 16 (9.1%)
WD36 16 (9.1%)
WD43 16 (9.1%)
WD44 16 (9.1%)
WD50 16 (9.1%)
WD57 16 (9.1%)
WD64 16 (9.1%)
WD8 16 (9.1%)
Weight
Mean (SD) 384.48 (127.16)
Median (IQR) 344.50 (267.00, 511.25)
Range 225.00, 628.00
Time
Mean (SD) 33.55 (19.51)
Median (IQR) 36.00 (15.00, 50.00)
Range 1.00, 64.00
#other data examinations are done above (and in the separate R script).

#First, let's plot the weight values of all 16 rats, differentiating between the three groups.

ggplot(RATS_l, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(name = "Weights in grams", limits = c(min(RATS_l$Weight), max(RATS_l$Weight))) + theme_bw()

# Next, let's make the same plot but with standardized weight values.

#standardizing weight measures, i.e., subtracting the relevant occasion mean from the original observation and divide it with the corresponding standard deviation.

RATS_ls <- RATS_l %>%
  group_by(Time) %>%
  mutate(std_Weight = (Weight - mean(Weight))/sd(Weight)) %>%
  ungroup()

glimpse(RATS_ls) #looks ok.
## Rows: 176
## Columns: 6
## $ ID         <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
## $ Group      <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1,...
## $ WD         <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", ...
## $ Weight     <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 5...
## $ Time       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8,...
## $ std_Weight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.88190...
# Plotting the standardized weights per group
ggplot(RATS_ls, aes(x = Time, y = std_Weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized Weights") + theme_bw()

#

Interpreting the plots

  • The plots show, or at least indicate, one outlier in each group but particularly in group 2.
  • Group 1 has (albeit without any testing) notably the lowest weight while rats in groups 2 and 3 are closer to each other in weight. Group 3 appears to maybe have significantly heavier rats compared to group 2. But hard to say without any confidence intervals.
  • The tracking phenomena can also be seen, i.e. heavier rats in the beginning tend to be heavier throughout the measurement points.
  • The plot of the standardized values of each observation is overall a bit easier to interpret, or at least clearer. Here the weight differences are quite clear but the slope differences harder to tell.

Summary analyses

#First, let's create a graph with group means of the rats' weights with standard error of the mean (se) indicating the variation at each measurement point.
#
#to calculate se we need to know n, which is now measurement times:
n <- RATS_l$Time %>% unique() %>% length()
n #11, correct.
## [1] 11
# Creating a summary data with mean and se of weight by group and measurement time 
RATS_s <- RATS_l %>%
  group_by(Group, Time) %>%
  summarise(mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
  ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
glimpse(RATS_s) #mean and se of weight values in each group for each measurement time.
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 3...
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375...
## $ se    <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3...
#Let's plot these mean weight profiles
ggplot(RATS_s, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.9,0.5)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)") + scale_x_continuous(name="time(days)") + theme_classic()

# Let's visualize the same group means also with a box plot.

ggplot(RATS_l, aes(x=Time, y=Weight,colour=Group, fill=Group)) + 
   stat_summary(fun = mean, geom = "bar", position="dodge", width=3) + 
   stat_summary(fun.data = mean_se, geom = "errorbar", position="dodge", width=3) + theme_bw()
## Warning: Computation failed in `stat_summary()`:
## Can't convert a double vector to function
## Warning: position_dodge requires non-overlapping x intervals

The plots with error bars (standard error of the means) confirms that group 1 differs from the other groups and shows that groups 2 and 3 differ from each other first but then get closer to each other, with the confidence intervals nearly touching (or actually touching/overlapping a bit!) at the end of the measurement period. Thus, based on the plots, these two groups may differ from each other, at least in overall means. However, the difference is not big and seems to mainly disapper in the end.

Since the question of most interest is “whether the growth profiles of the three groups differ” it would perhaps be most informative to examine summary statistics where the first group mean of rat weight is reduced from the final group mean of rat weight, in each group. Or examine with regression whether the rate of change in body weight differs between the groups. However, let’s just take the easiest route for now and do one summary graph with overall group means. We can account for the beginning weight in, for instance, Ancova (Analysis of covariance) later.

# First, creating summaries of data by Group and ID with mean as the summary variable.
#
RATS_s2 <- RATS_l %>%
  group_by(Group, ID) %>%
  summarise(mean=mean(Weight)) %>%
  ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
glimpse(RATS_s2)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 2...
#Drawing a boxplot of the mean versus treatment
ggplot(RATS_s2, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(weight), weeks 1-9")

# Let's remove the most notable outlier, namely the one in Group 2. 
# Creating a new data by filtering this one outlier 
RATS_s2_no <- RATS_s2 %>% filter(mean<570)

dim(RATS_s2_no)
## [1] 15  3
#Let's draw the above plot again without the outlier
ggplot(RATS_s2_no, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(weight), weeks 1-9")

The first boxplots show three outliers, with the one in group 2 being the most notable. Indeed, the weight values are not normally distributed in group 2 in the first plot. The weight measures also appear, at least somewhat skewed in group 3.

Removing the outlier from group 2 makes a big difference as the variance in that group is now a lot smaller and the weights seem to be normally distributed. It would have likely been best to also delete the other outliers, particularly from group 3. I realized this a bit too late…

The boxplots also show clearly group differences (but they are also calculated with overall means)

Examining the group differences with a few tests

# First, let's apply an analysis of variance (Anova) to examine the groups' differences. Let's use the data without the one outlier that was removed. 

res.aov <- aov(mean ~ Group, data = RATS_s2_no)
# Summary of the analysis
summary(res.aov)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Group        2 206390  103195   483.6 3.39e-12 ***
## Residuals   12   2561     213                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Some of the groups differ (significant F-test value at p <.001)

#multiple comparisons, Bonferroni corrected p-value
pairwise.t.test(RATS_s2_no$mean, RATS_s2_no$Group, p.adj = "bonf")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  RATS_s2_no$mean and RATS_s2_no$Group 
## 
##   1       2      
## 2 8.7e-10 -      
## 3 4.7e-12 5.4e-05
## 
## P value adjustment method: bonferroni
#The p-values show that all groups differ from each others when comparing group means in weight (bonferroni corrected p-value <.001)


# Let's take into account the first mean weight values in each group and run an Ancova.

#First counting the summary data without the first measurement value.

RATS_s3 <- RATS_l %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight)) %>%
  ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
#removing the one outlier
RATS_s3_no <- RATS_s3 %>% filter(mean<570)

#loading in the original data in a wide format
RATS_wide <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt",sep  ="\t",header=TRUE)

# Adding the baseline weights from the original data as a new variable to the summary data

#baseline <- RATS_wide$WD1
#RATS_s3_no <- RATS_s3_no %>%
  #mutate(baseline) # does not work since row numbers differ, the code in datacamp did't work for this either

# Fitting a linear model with the mean as the response 
fit <- lm(mean ~ Group, data = RATS_s2_no) #should add baseline but can't now as it's not in the data.

# Computing the analysis of variance table for the fitted model with anova()
summary(fit)
## 
## Call:
## lm(formula = mean ~ Group, data = RATS_s2_no)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.886  -3.080   3.273   9.250  14.386 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  263.716      5.165   51.06 2.09e-15 ***
## Group2       185.739      9.890   18.78 2.90e-10 ***
## Group3       262.080      8.945   29.30 1.56e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.61 on 12 degrees of freedom
## Multiple R-squared:  0.9877, Adjusted R-squared:  0.9857 
## F-statistic: 483.6 on 2 and 12 DF,  p-value: 3.387e-12
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Group      2 206390  103195   483.6 3.387e-12 ***
## Residuals 12   2561     213                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#The anova (without baseline) and regression shows similar results as above. I'm interested to see how this baseline should have been added!

A note from the book: “Baseline measurements of the outcome variable in a longitudinal study are often correlated with the chosen summary measure and using such measures in the analysis can often lead to substantial gains in precision when used appropriately as a covariate in an analysis of covariance.”

Analysing the BPRS data (following Chapter 9 of MABS)

A short description of the data, taken from the book: The data is about “40 male subjects who were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.” I am assuming that higher score = higher chance of schizophrenia, or at least more schizophrenic symptoms.

#First, let's ignore the repeated-measures in the data and treat each observation as independent.

#Print out a summary table of all variables
my_table2 <- tbl_summary(BPRS_l,
                        digits = all_continuous() ~ 2,
                        type = all_continuous() ~ "continuous2",
                        statistic = list(all_continuous() ~ c("{mean} ({sd})",
                                                              "{median} ({p25}, {p75})", 
                                                              "{min}, {max}")))

#Add some missing elements to the table and print it out
my_table2 %>% bold_labels()
Characteristic N = 360
treatment
1 180 (50%)
2 180 (50%)
subject
1 18 (5.0%)
2 18 (5.0%)
3 18 (5.0%)
4 18 (5.0%)
5 18 (5.0%)
6 18 (5.0%)
7 18 (5.0%)
8 18 (5.0%)
9 18 (5.0%)
10 18 (5.0%)
11 18 (5.0%)
12 18 (5.0%)
13 18 (5.0%)
14 18 (5.0%)
15 18 (5.0%)
16 18 (5.0%)
17 18 (5.0%)
18 18 (5.0%)
19 18 (5.0%)
20 18 (5.0%)
weeks
week0 40 (11%)
week1 40 (11%)
week2 40 (11%)
week3 40 (11%)
week4 40 (11%)
week5 40 (11%)
week6 40 (11%)
week7 40 (11%)
week8 40 (11%)
bprs
Mean (SD) 37.66 (13.66)
Median (IQR) 35.00 (27.00, 43.00)
Range 18.00, 95.00
week
0 40 (11%)
1 40 (11%)
2 40 (11%)
3 40 (11%)
4 40 (11%)
5 40 (11%)
6 40 (11%)
7 40 (11%)
8 40 (11%)
head(BPRS_l,41) #The summary table shows that there are 20 subject but there are actually 40, the ID numbers are from 1-20 from both treatment groups and the groups have different people!
##    treatment subject weeks bprs week
## 1          1       1 week0   42    0
## 2          1       2 week0   58    0
## 3          1       3 week0   54    0
## 4          1       4 week0   55    0
## 5          1       5 week0   72    0
## 6          1       6 week0   48    0
## 7          1       7 week0   71    0
## 8          1       8 week0   30    0
## 9          1       9 week0   41    0
## 10         1      10 week0   57    0
## 11         1      11 week0   30    0
## 12         1      12 week0   55    0
## 13         1      13 week0   36    0
## 14         1      14 week0   38    0
## 15         1      15 week0   66    0
## 16         1      16 week0   41    0
## 17         1      17 week0   45    0
## 18         1      18 week0   39    0
## 19         1      19 week0   24    0
## 20         1      20 week0   38    0
## 21         2       1 week0   52    0
## 22         2       2 week0   30    0
## 23         2       3 week0   65    0
## 24         2       4 week0   37    0
## 25         2       5 week0   59    0
## 26         2       6 week0   30    0
## 27         2       7 week0   69    0
## 28         2       8 week0   62    0
## 29         2       9 week0   38    0
## 30         2      10 week0   65    0
## 31         2      11 week0   78    0
## 32         2      12 week0   38    0
## 33         2      13 week0   63    0
## 34         2      14 week0   40    0
## 35         2      15 week0   40    0
## 36         2      16 week0   54    0
## 37         2      17 week0   33    0
## 38         2      18 week0   28    0
## 39         2      19 week0   52    0
## 40         2      20 week0   47    0
## 41         1       1 week1   36    1
str(BPRS_l)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
#Let's recode the subject values so that everyone has individual "id". Maybe this helps later on with some errors I got in the plots (it did, yay!)
BPRS_l_test <- BPRS_l %>%
  mutate(subject = ifelse(as.character(treatment) == "2", recode(as.character(subject), "1" = 21, "2" = 22, "3" = 23, "4" = 24, "5" = 25, "6" = 26, "7" = 27, "8" =28, "9"=29, "10"=30, "11"=31, "12"=32, "13"=33, "14"=34, "15"=35, "16"=36, "17"=37, "18"=38, "19"=39, "20"=40), subject))


#Let's start with plotting the data. Let's note group partnerships but ignore the longitudinal nature of the data for now.

# Drawing the measures against week plots
#
#First, let's make a scatterplot. And add some jitter to maybe see the overlapping observations a bit better.
p1 <- ggplot(BPRS_l, aes(x = week, y = bprs, group = subject))
p2 <- p1 + geom_text(aes(label = treatment), position=position_jitter(width=0.15,height=0.1))
p3 <- p2 + scale_x_continuous(name = "Week", breaks = seq(0, 8, 1))
p4 <- p3 + scale_y_continuous(name = "Measure")
p5 <- p4 + theme_bw()
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6

# Still ignoring the repeated-measures structure of the data, let's fit a multiple linear regression model where bprs is response and week and treatment are explanatory variables.

# create a regression model BPRS_reg
BPRS_reg <- lm(bprs ~ week + treatment, data=BPRS_l)

# print out a summary of the model
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS_l)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16
#Let's NOT include a week*treatment interaction term into the model,
#not smart since the model structure has repeated measures, i.e. data points are not independent!
#BPRS_reg2 <- lm(bprs ~ week * treatment, data=BPRS_l)

Both groups seem to have very overlapping values in the scatter plot, i.e. there does not seem to be, at least no clear, differences between the treatment groups.

The linear model (regression model) supports this view since treatment2 does not differ significantly from treatment1 (p > 0.1). Treatment week seems to have a negative effect on symptoms (p <.001) but we need to remember that the datapoints are not actually uncorrelated with each other (measurements from same individuals)

Moving onto more appropriate plots and analyses for repeated-measures i.e. longitudinal data

#making a line plot which accounts for the longitudinal structure of the data by joining together the measurement values from same individual.

p7 <- ggplot(BPRS_l_test, aes(x = week, y = bprs, group = subject))
p8 <- p7 + geom_line(aes(colour = treatment))
p9 <- p8 + scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1))
p10 <- p9 + scale_y_continuous(name = "Measure(bprs)")
p11 <- p10 + theme_bw() + theme(legend.position = "top")
p12 <- p11 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p12

#same plot but with linetype, not colour indicating treatment groups.
ggplot(BPRS_l_test, aes(x = week, y = bprs, group = subject)) +
  geom_line(aes(linetype=treatment)) + scale_x_continuous(name = "Time (days)", breaks = seq(0, 8, 1)) +
  scale_y_continuous(name = "Weight (grams)") + theme(legend.position = "top")

#Let's make an additional plot.
ggplot(BPRS_l, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRS_l$bprs), max(BPRS_l$bprs)))

#Well, this is not any easier to interpret due to the large amount of subjects. Differences are easier to see in the above graphs, in the same plots.

The plots show that, indeed, the individuals in different groups do not form clear groups in regards to symptom change. There seems to also be at least one (blue, treatment group2) outlier in the data. The plots do show the overall downward trend of the symptoms, though. Also the variability in symptoms seems to be narrower in the end with a smaller range of symptoms at 8 weeks. The faceted plot also indicates that there may be a bit more symptom variability in group 2, even when ignoring the one outlier in that group.

#loading in the data in a wide format:
BPRS_wide <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt",sep  =" ",header=TRUE)

#Let's plot the bprs-values per week
pairs(BPRS_wide[,3:11], cex = 0.7)

#The plot shows that the measurements are related, there is an observable trend in many of the pictures.

#Let's plot the bprs-values per week with treatment group coloured
pairs(BPRS_wide[,2:11], col=BPRS_wide$treatment, cex = 0.7)

library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
#First, let's fit a random intercept model and include the two explanatory variables: week and treatment.

#Fitting the random intercept model with the subject as the random effect
BPRS_rim <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRS_l, REML = FALSE)

# Print the summary of the model
summary(BPRS_rim)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRS_l
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000
#treatment group has no effect on symptoms.

#Next, let's fit a random intercept AND random slope model
#Let's set week and subject as the random effects
BPRS_riasm <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS_l, REML = FALSE)

# print a summary of the model
summary(BPRS_riasm)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRS_l
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
# performinf an ANOVA test to compare the two models
anova(BPRS_riasm, BPRS_rim)
## Data: BPRS_l
## Models:
## BPRS_rim: bprs ~ week + treatment + (1 | subject)
## BPRS_riasm: bprs ~ week + treatment + (week | subject)
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_rim      5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_riasm    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# The model with also a random slope for each individual is better at predicting symptom change, fits the data better. This can be seen from the Chisq value and AIC, and BIS (smaller is better). However, the difference between the models is not big at all (p = .03).


# Fitting a random intercept and slope model with a treatment × week interaction. That is, making the same model as previously but with interaction.
BPRS_riasmi <- lmer(bprs ~ week * treatment + (week | subject), data = BPRS_l, REML = FALSE)

summary(BPRS_riasmi)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRS_l
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
#comparing the last two models with Anova
# Computing the analysis of variance tables of the two models
anova(BPRS_riasmi, BPRS_riasm)
## Data: BPRS_l
## Models:
## BPRS_riasm: bprs ~ week + treatment + (week | subject)
## BPRS_riasmi: bprs ~ week * treatment + (week | subject)
##             npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_riasm     7 2745.4 2772.6 -1365.7   2731.4                       
## BPRS_riasmi    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Making a plot of BPRS_l with the observed bprs values
ggplot(BPRS_l_test, aes(x = week, y = bprs, group = subject)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
  scale_y_continuous(name = "Score(bprs") +
  theme(legend.position = "top")

# Creating a vector of the fitted values
Fitted <- fitted(BPRS_riasmi)

# Creating a new column fitted to BPRS_l_test2
BPRS_l_test2 <- BPRS_l_test %>%
  mutate(Fitted)

# draw the plot of BPRS with the Fitted values of bprs
# #instead of linetype let's do colour = treatment.
ggplot(BPRS_l_test2, aes(x = week, y = Fitted, group = subject)) +
  geom_line(aes(colour = treatment)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
  scale_y_continuous(name = "Fitted bprs") +
  theme(legend.position = "top")

#Let's plot the observed and plotted values next to each other:
px1 <- ggplot(BPRS_l_test, aes(x = week, y = bprs, group = subject))
px2 <- px1 + geom_line(aes(linetype = treatment))
px3 <- px2 + scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1))
px4 <- px3 + scale_y_continuous(name = "bprs")
px5 <- px4 + theme_bw() + theme(legend.position = "right")
px6 <- px5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
px7 <- px6 + ggtitle("Observed")
graph1 <- px7
py1 <- ggplot(BPRS_l_test2, aes(x = week, y = Fitted, group = subject))
py2 <- py1 + geom_line(aes(linetype = treatment))
py3 <- py2 + scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1))
py4 <- py3 + scale_y_continuous(name = "bprs")
py5 <- py4 + theme_bw() + theme(legend.position = "right")
py6 <- py5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
py7 <- py6 + ggtitle("Fitted")
graph2 <- py7
graph1; graph2

#Again, no differences between the groups.